Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
pip install yellowbrick
Requirement already satisfied: yellowbrick in c:\users\ak153\anaconda3\lib\site-packages (1.5) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\users\ak153\anaconda3\lib\site-packages (from yellowbrick) (3.3.4) Requirement already satisfied: scikit-learn>=1.0.0 in c:\users\ak153\anaconda3\lib\site-packages (from yellowbrick) (1.3.0) Requirement already satisfied: scipy>=1.0.0 in c:\users\ak153\anaconda3\lib\site-packages (from yellowbrick) (1.6.2) Requirement already satisfied: cycler>=0.10.0 in c:\users\ak153\anaconda3\lib\site-packages (from yellowbrick) (0.10.0) Requirement already satisfied: numpy>=1.16.0 in c:\users\ak153\anaconda3\lib\site-packages (from yellowbrick) (1.20.1) Requirement already satisfied: six in c:\users\ak153\anaconda3\lib\site-packages (from cycler>=0.10.0->yellowbrick) (1.15.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\ak153\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.7) Requirement already satisfied: python-dateutil>=2.1 in c:\users\ak153\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\ak153\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\ak153\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (8.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\ak153\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.1.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\ak153\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.3.1) Note: you may need to restart the kernel to use updated packages.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
#format numeric data for easier readability
pd.set_option(
"display.float_format", lambda x: "%.2f" % x
) # to display numbers rounded off to 2 decimal places
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Importing andreading csv file into a pandas Dataframe
data = pd.read_csv('C:/Users/AK153/GREAT_LEARNING/Projects/MODULE 7 UNSUPERVISED LEARNING- TRADE&AHEAD/stock_data.csv')
# copying data to another varaible to preserve original data
df = data.copy()
#Checking Shape of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
There are 340 rows and 15 columns.
# print a sample of 10 rows randomly selected from the training data
df.sample(n=10,random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.00 | -15.48 | 2.92 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 406582278.50 | 93.09 | 1.79 |
| 125 | FB | Information Technology | Internet Software & Services | 104.66 | 16.22 | 1.32 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2800763359.00 | 79.89 | 5.88 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.03 | 7.58 | 1.16 | 15 | 47 | 21818000 | 248710000 | 1.52 | 163625000.00 | 26.34 | -1.27 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.41 | 10.66 | 0.81 | 17 | 129 | 160383000 | 636056000 | 3.28 | 491391569.00 | 24.07 | -2.26 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.61 | 0.87 | 1.59 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 765298142.70 | 93.09 | 3.35 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.52 | -8.70 | 1.48 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.40 | 17.68 | -3.84 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.13 | -21.25 | 2.36 | 2 | 201 | 523803000 | 85171000 | 0.56 | 152091071.40 | 93.09 | 9.57 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.04 | -5.10 | 1.97 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 849367088.60 | 93.09 | 17.35 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.62 | 3.65 | 1.16 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 446887159.50 | 10.43 | -0.38 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.48 | 7.07 | 1.58 | 12 | 67 | 412000000 | 968100000 | 2.26 | 428362831.90 | 14.81 | 4.22 |
# Checking The Column Names And Datatype
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# Checking for Duplicate Values
df.duplicated().sum()
0
# Checking For Misiing Values
df.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
# Convert all columns with dtype object into category
for col in df.columns[df.dtypes=='object']:
df[col]=df[col].astype('category')
# dropping the ticker symbol column, as it does not provide any information
df.drop("Ticker Symbol",axis=1,inplace=True)
# Confirm New Dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Security 340 non-null category 1 GICS Sector 340 non-null category 2 GICS Sub Industry 340 non-null category 3 Current Price 340 non-null float64 4 Price Change 340 non-null float64 5 Volatility 340 non-null float64 6 ROE 340 non-null int64 7 Cash Ratio 340 non-null int64 8 Net Cash Flow 340 non-null int64 9 Net Income 340 non-null int64 10 Earnings Per Share 340 non-null float64 11 Estimated Shares Outstanding 340 non-null float64 12 P/E Ratio 340 non-null float64 13 P/B Ratio 340 non-null float64 dtypes: category(3), float64(7), int64(4) memory usage: 46.7 KB
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Security | 340 | 340 | 3M Company | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.00 | NaN | NaN | NaN | 80.86 | 98.06 | 4.50 | 38.55 | 59.70 | 92.88 | 1274.95 |
| Price Change | 340.00 | NaN | NaN | NaN | 4.08 | 12.01 | -47.13 | -0.94 | 4.82 | 10.70 | 55.05 |
| Volatility | 340.00 | NaN | NaN | NaN | 1.53 | 0.59 | 0.73 | 1.13 | 1.39 | 1.70 | 4.58 |
| ROE | 340.00 | NaN | NaN | NaN | 39.60 | 96.55 | 1.00 | 9.75 | 15.00 | 27.00 | 917.00 |
| Cash Ratio | 340.00 | NaN | NaN | NaN | 70.02 | 90.42 | 0.00 | 18.00 | 47.00 | 99.00 | 958.00 |
| Net Cash Flow | 340.00 | NaN | NaN | NaN | 55537620.59 | 1946365312.18 | -11208000000.00 | -193906500.00 | 2098000.00 | 169810750.00 | 20764000000.00 |
| Net Income | 340.00 | NaN | NaN | NaN | 1494384602.94 | 3940150279.33 | -23528000000.00 | 352301250.00 | 707336000.00 | 1899000000.00 | 24442000000.00 |
| Earnings Per Share | 340.00 | NaN | NaN | NaN | 2.78 | 6.59 | -61.20 | 1.56 | 2.90 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.00 | NaN | NaN | NaN | 577028337.75 | 845849595.42 | 27672156.86 | 158848216.10 | 309675137.80 | 573117457.32 | 6159292035.00 |
| P/E Ratio | 340.00 | NaN | NaN | NaN | 32.61 | 44.35 | 2.94 | 15.04 | 20.82 | 31.76 | 528.04 |
| P/B Ratio | 340.00 | NaN | NaN | NaN | -1.72 | 13.97 | -76.12 | -4.35 | -1.07 | 3.92 | 129.06 |
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="purple"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="red", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature,xlabel,ylabel, perc=True, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
if count <= 3:
plt.figure(figsize=(count + 4, count + 4))
elif count>3 and count <= 15:
plt.figure(figsize=(count + 2, count + 3 ))
else:
plt.figure(figsize=(count + 200 , count + 300))
else:
if n <= 15:
plt.figure(figsize=(n + 2, count))
else:
plt.figure(figsize=(n + 20, count))
if count<=3:
plt.xticks(rotation=65,ha='right', fontsize=count*6,fontweight='bold')# Adjust the fonts of x-axis ticks
plt.yticks(fontsize=count*4,fontweight='bold')
elif count>3 and count<=5:
plt.xticks(rotation=65,h='right', fontsize=count*4,fontweight='bold')# Adjust the fonts of x-axis ticks
plt.yticks(fontsize=count*2.5,fontweight='bold') # Adjust the fontsize for y-axis ticks as needed
elif count>5 and count<=12:
plt.xticks(rotation=65,ha='right', fontsize=count*2,fontweight='bold')# Adjust the fonts of x-axis ticks
plt.yticks(fontsize=count*2.5,fontweight='bold') # Adjust the fontsize for y-axis ticks as needed
else:
plt.xticks(rotation=65,ha='right', fontsize=count*1,fontweight='bold')# Adjust the fonts of x-axis ticks
plt.yticks(fontsize=count*1.3,fontweight='bold') # Adjust the fontsize for y-axis ticks as needed
ax = sns.countplot(
data=data,
x=feature,
palette="Set1",
order=data[feature].value_counts().index[:n],
)
ax.set_xlabel(xlabel)
if count<=3:
plt.xlabel(xlabel,rotation=0, fontsize=count*6)# Adjust the fonts of x-axis label
elif count>3 and count<=5:
plt.xlabel(xlabel,rotation=0, fontsize=count*4)# Adjust the fonts of x-axis label
elif count>5 and count<=12:
plt.xlabel(xlabel,rotation=0, fontsize=count*2)# Adjust the fonts of x-axis label
else:
plt.xlabel(xlabel,rotation=0, fontsize=count*1.2)# Adjust the fonts of x-axis label
ax.set_ylabel(ylabel)
if count<=3:
plt.ylabel(ylabel,rotation=90, fontsize=count*6)# Adjust the fonts of y-axis label
elif count>3 and count<=5:
plt.ylabel(ylabel,rotation=90, fontsize=count*4)# Adjust the fonts of y-axis label
elif count>5 and count<=12:
plt.ylabel(ylabel,rotation=90, fontsize=count*2)# Adjust the fonts of y-axis label
else:
plt.ylabel(ylabel,rotation=90, fontsize=count*1.2)# Adjust the fonts of y-axis label
for patch in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * patch.get_height() / total
) # percentage of each class of the category
else:
label = patch.get_height() # count of each level of the category
x = patch.get_x() + patch.get_width() / 2 # width of the plot
y = patch.get_height() # height of the plot
if count<= 15: # Condition for less than or equal to 15 bars
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=16,
xytext=(0, 9),
textcoords="offset points",
) # annotate the percentage
elif count > 15 and count <= 25: # Condition for 15 to 25 bars
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=20,
xytext=(0, 9),
textcoords="offset points",
) # annotate the percentage
else: # Condition for more than 25 bars
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=35,
xytext=(0, 9),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
histogram_boxplot(df,'Current Price') # Writing the code to create histogram_boxplot for 'Current Price'
histogram_boxplot(df,'Price Change') # Writing the code to create histogram_boxplot for 'Price Change'
histogram_boxplot(df,'Volatility') # Writing the code to create histogram_boxplot for 'Volatility'
histogram_boxplot(df,'ROE') # Writing the code to create histogram_boxplot for 'ROE'
histogram_boxplot(df,'Cash Ratio') # Writing the code to create histogram_boxplot for 'Cash Ratio'
histogram_boxplot(df,'Net Cash Flow') # Writing the code to create histogram_boxplot for 'Net Cash Flow'
histogram_boxplot(df,'Net Income') # Writing the code to create histogram_boxplot for 'Net Income'
histogram_boxplot(df,'Earnings Per Share') # Writing the code to create histogram_boxplot for 'Earnings Per Share'
histogram_boxplot(df,'Estimated Shares Outstanding') # Writing the code to create histogram_boxplot for 'Estimated Shares Outstanding'
histogram_boxplot(df,'P/E Ratio') # Writing the code to create histogram_boxplot for 'P/E Ratio'
histogram_boxplot(df,'P/B Ratio') # Writing the code to create histogram_boxplot for 'P/B Ratio'
labeled_barplot(df,'GICS Sector',xlabel='GICS Sector',ylabel='Count')# Writing the code to create labeled_barplot for 'GICS Sector'
#display the five sectors with the most number of stocks
df["GICS Sector"].value_counts().head(n=5)
Industrials 53 Financials 49 Consumer Discretionary 40 Health Care 40 Information Technology 33 Name: GICS Sector, dtype: int64
labeled_barplot(df,'GICS Sub Industry',xlabel='GICS Sub Industry',ylabel='Count') # Writing the code to create labeled_barplot for 'GICS Sub Industry'